Overview
Your efforts can help Make Data Count (MDC). Scientific data are critically undervalued even though they provide the basis for discoveries and innovations. We aim to improve our understanding of the links between scientific papers and the data used in those studies—what, when, and how they are mentioned. This will help to establish the value and impact of open scientific data for reuse. The current version of the MDC Data Citation Corpus is an aggregation of links between data and papers, but these links are incomplete: they only cover a fraction of the literature and do not provide context on how the data were used. The outcome of this competition is a highly performant model that will continuously run on scientific literature to automate the addition of high quality and contextualized data-to-paper connections in to the MDC Data Citation Corpus.

Start

a month ago
Close
2 months to go
Merger & Entry
Description
Goal of the Competition
You will identify all the data citations (references to research data) from the full text of scientific literature and tag the type of citation (primary or secondary):

Primary - raw or processed data generated as part of the paper, specifically for the study
Secondary - raw or processed data derived or reused from existing records or published data
Context
Make Data Count (MDC) is a global, community-driven initiative focused on establishing open standardized metrics for the evaluation and reward of research data reuse and impact. Through both advocacy and infrastructure projects, Make Data Count facilitates the recognition of data as a primary research output, promoting data sharing and reuse across data communities.

Highlighting and valuing data contributions will lead to a more collaborative, transparent, and efficient science, ultimately driving innovation and progress. To assure that this can happen, we need to connect and contextualize data, their relationship to papers, and their reuse.

Why this is not YET a solved problem
Studies have shown that most research data remain “uncited” (~86 %) (Peters, I., Kraker, P., Lex, E. et al., 2016, https://doi.org/10.1007/s11192-016-1887-4) in the current data citation system which makes it very hard to identify and record them. In addition, references to data are harder to programmatically identify because of the many ways they are mentioned. For example, authors may provide a full description of the data in the methods section, they may indirectly mention it elsewhere, or provide a formal citation in the reference list. Additionally, authors may use variable language when describing the type of relationship between the data and paper, such as suggesting data their data are openly available (“publicly available”) or mentioning the data were obtained from elsewhere (for instance, “obtained from”).

Potential Impact
The winning Kaggle model will enable MDC to update, release, and maintain an open, comprehensive and high quality set of references to data in scientific papers. The subsequent corpus will be made freely available for use by research communities, allowing for the development of better tools, a better understanding of how data are reused, improved ways of capturing broader researcher outputs, and a shift towards valuing data.

Evaluation
In this competition, the metric is F1-Score. The F1-score, commonly used in information retrieval, measures accuracy using the statistics precision (p) and recall (r). Precision is the ratio of true positives (tp) to all predicted positives (tp + fp). Recall is the ratio of true positives to all actual positives (tp + fn). The F1 score is given by:


where:



The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.

Submission File
You must identify data references contained in the test dataset. These predictions form unique tuples of (article_id, dataset_id, type). If an article contains multiple references of the same dataset_id and type, it should only be predicted once. Only articles containing data references be included in submissions. Articles with no data references should not be included in the submission, and will be penalized as false positives if they are. When mining the research paper full text, the DOIs may appear with or without the 'https://doi.org' stem. Convert all DOIs into the full DOI format in the submission (https://doi.org/[prefix]/[suffix]). The file should contain a header and use the following format:

row_id,article_id,dataset_id,type
0,10.1002_cssc.202201821,https://doi.org/10.5281/zenodo.7074790,Primary
1,10.1002_esp.5090,CHEMBL1097,Secondary
...


Prizes
1st Place - $40,000
2nd Place - $20,000
3rd Place - $17,000
4th Place - $13,000
5th Place - $10,000
Competition prizes are kindly sponsored by The Navigation Fund and Chan Zuckerberg Initiative.

Make Data Count
Make Data Count is a global, community-led initiative focused on the development of open data assessment metrics. Driven by partners at academic institutions and non-profit research infrastructure providers, MDC has worked for over a decade to develop technical infrastructure and standardized approaches for assessing data usage, produce evidence-based studies on researcher behavior in citing and using data, and to drive a community of practice around responsible and meaningful evaluation of data reuse. MDC has long been supported by philanthropic organizations, in-kind support from partner institutions, and contributions from community partners.

The competition is sponsored by MDC's fiscal home, DataCite International Data Citation Initiative e.V, with prize funds from The Navigation Fund and Chan Zuckerberg Initiative

Code Requirements


Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 9 hours run-time
GPU Notebook <= 9 hours run-time
Internet access disabled
Freely & publicly available external data is allowed, including pre-trained models
Submission file must be named submission.csv
Please see the Code Competition FAQ for more information on how to submit. And review the code debugging doc if you are encountering submission errors.

Citation
Make Data Count, Maggie Demkin, and Walter Reade. Make Data Count - Finding Data References. https://kaggle.com/competitions/make-data-count-finding-data-references, 2025. Kaggle.

Dataset Description
Data Overview
In this competition, participants will extract all research data referenced in a scientific paper (by their identifier) and classify it based on its context as a primary or secondary citation.

Paper and Dataset Identifiers
Each object (paper and dataset) has a unique, persistent identifier to represent it. In this competition there will be two types:

DOIs are used for all papers and some datasets. They take the following form: https://doi.org/[prefix]/[suffix]. Examples:
https://doi.org/10.1371/journal.pone.0303785
https://doi.org/10.5061/dryad.r6nq870
Accession IDs are used for some datasets. They vary in form by individual data repository where the data live. Examples:
"GSE12345" (Gene Expression Omnibus dataset)
“PDB 1Y2T” (Protein Data Bank dataset)
"E-MEXP-568" (ArrayExpress dataset)
Files
train/{PDF,XML} - the training articles, in PDF and XML format
IMPORTANT: Not all PDF articles have a corresponding XML file (approx. 75% do)
test/{PDF,XML} - the test articles, in PDF and XML format
The rerun test dataset has approximately 2,600 articles.
train_labels.csv - labels for the training articles
article_id - research paper DOI, which will be located in the full text of the paper
dataset_id - the dataset identifier and citation type in the paper.
type - citation type
Primary - raw or processed data generated as part of this paper, specifically for this study
Secondary - raw or processed data derived or reused from existing records or published data
sample_submission.csv - a sample submission file in the correct format
The full text of the scientific papers were downloaded in PDF & XML from at: Europe PMC open access subset.

Data Citation Mining Examples
To illustrate how research data are mentioned in the scientific literature, here are some examples:
Note: in the text, the dataset identifier may appear with or without the 'https://doi.org' stem.

Paper: https://doi.org/10.1098/rspb.2016.1151
Data: https://doi.org/10.5061/dryad.6m3n9
In-text span: "The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9."
Citation type: Primary
Paper: https://doi.org/10.1098/rspb.2018.1563
Data: https://doi.org/10.5061/dryad.c394c12
In-text span: "Phenotypic data and gene sequences are available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.c394c12"
Citation type: Primary
Paper: https://doi.org/10.1534/genetics.119.302868
Data: https://doi.org/10.25386/genetics.11365982
In-text span: "The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. Supplemental material available at figshare: https://doi.org/10.25386/genetics.11365982."
Citation type: Primary
Paper: https://doi.org/10.1038/sdata.2014.33
Data: GSE37569, GSE45042, GSE28166
In-text span: "Primary data for Agilent and Affymetrix microarray experiments are available at the NCBI Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) under the accession numbers GSE37569, GSE45042 , GSE28166"
Citation type: Primary
Paper: https://doi.org/10.12688/wellcomeopenres.15142.1
Data: pdb 5yfp
In-text span: “Figure 1. Evolution and structure of the exocyst. A) Cartoon representing the major supergroups, which are referred to in the text. The inferred position of the last eukaryotic common ancestor (LECA) is indicated and the supergroups are colour coordinated with all other figures. B) Structure of trypanosome Exo99, modelled using Phyre2 (intensive mode). The model for the WD40/b-propeller (blue) is likely highly accurate. The respective orientations of the a-helical regions may form a solenoid or similar, but due to a lack of confidence in the disordered linker regions this is highly speculative. C and D) Structure of the Saccharomyces cerevisiae exocyst holomeric octameric complex. In C the cryoEM map (at level 0.100) is shown and in D, the fit for all eight subunits (pdb 5yfp). Colours for subunits are shown as a key, and the orientation of the cryoEM and fit are the same for C and D. All structural images were modelled by the authors from PDB using UCSF Chimera.”
Citation type: Secondary
Paper: https://doi.org/10.3389/fimmu.2021.690817
Data: E-MTAB-10217, PRJE43395
In-text span: “The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ebi.ac.uk/arrayexpress/, E-MTAB-10217 and https://www.ebi.ac.uk/ena, PRJE43395.”
Citation type: Secondary